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Q Abstract 

A lot of effort has been invested into characterizing the convergence rates of gradient based algorithms for 
non-linear convex optimization. Recently, motivated by large datasets and problems in machine learning, the interest 

7— I 

has shifted towards distributed optimization. In this work we present a distributed algorithm for strongly convex 
constrained optimization. Each node in a network of n computers converges to the optimum of a strongly convex, 
L-Lipchitz continuous, separable objective at a rate O ^ le> s (^" T) ^ where T is the number of iterations. This rate 
is achieved in the online setting where the data is revealed one at a time to the nodes, and in the batch setting where 
each node has access to its full local dataset from the start. The same convergence rate is achieved in expectation 

. . when the subgradients used at each node are corrupted with additive zero-mean noise. 

> 
•l-H 

X 

?_i I. Introduction 

a 

In this work we focus on solving optimization problems of the form 

1 T 

minimize F(w) = — }^ f l (w) (1) 
t=l 

where each function ,f 1 (w), f 2 {w), . . . , is convex over a convex set W C R d . This formulation applies widely in 
machine learning scenarios, where f t (w) measures the loss of model w with respect to data point t, and F(w) is the 
average loss over T data points. In particular, we are interested in the behavior of online distributed optimization 
algorithms for this sort of problem as the number of data points T tends to infinity. We describe a distributed 
algorithm which, for strongly convex functions /*, converges at a rate O ^ lQ g(V" T ) y -p t jj e ^ est f our knowledge 
this is the first distributed algorithm to achieve this converge rate for constrained optimization without relying on 
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smoothness assumptions on the objective or non-trivial communication mechanisms between the nodes. The result 
is true both in the online and the batch optimization setting. 

When faced with a non-linear convex optimization problem, gradient-based methods can be applied to find the 
solution. The behavior of these algorithms is well-understood in the single-processor (centralized) setting. Under 
the assumption that the objective is L-Lipschitz continuous, projected gradient descent-type algorithms converge at 
a rate 0(^=) |2|. This rate is achieved both in an online setting where the /*'s are revealed to the algorithm 
sequentially and in the batch setting where all /' are known in advance. If the cost functions are also strongly convex 
then gradient algorithms can achieve linear rates, O (^), in the batch setting [3 1 and nearly-linear rates, O ^ log j, T - ) , 
in the online setting |4|. Under additional smoothness assumptions, such as Lipschitz continuous gradients, the same 
rate of convergence can also be achieved by second order methods in the online setting J5}, (6), while accelerated 
methods can achieve a quadratic rate in the batch setting; see (7J and references therein. 

The aim of this work is to extend the aforementioned results to the distributed setting where a network of 
processors jointly optimize a similar objective. Assuming the network is arranged as an expander graph with 
constant spectral gap, for general convex cost functions that are only L-Lipschitz continuous, the rate at which 
existing algorithms on a network of n processors will all reach the optimum value is Q( lo g(JV") ^ j e ^ similar 
to the optimal single processor algorithms up to a logarithmic factor (8), (5). This is true both in a batch setting 



and in an online setting, even when the gradients are corrupted by noise. The technique proposed in flO) makes 
use of mini -batches to obtain asymptotic rates O f° r online optimization of smooth cost functions that 

have Lipschitz continuous gradients corrupted by bounded variance noise, and O for smooth strongly convex 
functions. However, this technique requires that each node exchange messages with every other node at the end 
of each iteration. Finally, if the objective function is strongly convex and three times differentiable, a distributed 

achieves a rate of O ( l ° s ^ ^ for unconstrained problems in the 



version of Nesterov's accelerated method [11 



batch setting, but the dependence on n is not characterized. 

The algorithm presented in this paper achieves a rate O ( los ( -^" T ' 1 ^ for strongly convex functions. Our formulation 
allows for convex constraints in the problem and assumes the objective function is Lipschitz continuous and strongly 
convex; no higher-order smoothness assumptions are made. Our algorithm works in both the online and batch 
setting and it scales nearly-linearly in number of iterations for network topologies with fast information diffusion. 
In addition, at each iteration nodes are only required to exchange messages with a subset of other nodes in the 
network (their neighbors). 

The rest of the paper is organized as follows. Section|n]introduces notation and formalizes the problem. Section[lIl] 
describes the proposed algorithm and states our main results. These results are proven in Section IV and Section [V] 
extends the analysis to the case where gradients are noisy. Section [VT] presents the results of numerical experiments 



illustrating the performance of the algorithm, and the paper concludes in Section VII 
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II. Online Convex Optimization 

Consider the problem of minimizing a convex function F(w) over a convex set VV C R d . Of particular interest 
is the setting where the algorithm sequentially receives noisy samples of the (sub)gradients of F(w). This setting 
arises in online loss minimization for machine learning when the data arrives as a steam and the (sub)gradient is 
evaluated using an individual data point at each step Suppose the tth data point x(t) E X C R d is drawn 
i.i.d. from an unknown distribution V, and let = f(w,x(t)) denote the loss of this data point with respect 

to a particular model w. In this setting one would like to find the model w that minimizes the expected loss 
~Er>[f(w,x)], possibly with the constraint that w be restricted to a model space VV. Clearly, as T — > oo, the 
objective F(w) — ^ Ylt=i ~~ * ^>T>[f(w, x)], and so if the data stream is finite this motivates minimizing the 
empirical loss F(w). 

An online convex optimization algorithm observes a data stream x(l),x(2), . . ., and sequentially chooses a se- 
quence of models w(l), w(2), . . . , after each observation. Upon choosing w(t), the algorithm receives a subgradient 
g(t) € df t (w(t)). The goal is for the sequence w(l),w(2), ... to converge to a minimizer w* of F(w). 

The performance of an online optimization algorithm is measured in terms of the regret: 

T T 

fl(T)=^/ ( Hi))-mm^/ t H. (2) 

t=i w t=i 

The regret measures the gap between the cost accumulated by the online optimization algorithm over T steps and 
that of a model chosen to simultaneously minimize the total regret over all T cost terms. If the costs /* are allowed 
to be arbitrary convex functions then it can be shown that the best achievable rate for any online optimization 
algorithm is R ^ — f2(^=), and this bound is also achievable |T|. The rate can be significantly improved if the 
cost functions has more favourable properties. 

A. Assumptions 

Assumption 1: We assume for the rest of the paper that each cost function f*{w) = f(w,x(t)) is a-strongly 
convex for all x(t) € X; i.e., there is a a > such that for all 8 £ [0, 1] and all u, w G W 

/*(<?«+(! - 0)w) < 

ef\u) + (1 - 6)f{w) - °-6{l - 9) \\u - w\\ 2 . (3) 

If each is cr-strongly convex, it follows that F(w) is also er-strongly convex. Moreover, if F(w) is strongly 

convex then it is also strictly convex, and so F(w) has a unique minimizer which we denote by w* . 

Assumption 2: We also assume that the subgradients g[t) of each cost function /* are bounded by a known 
constant L > 0; i.e., ||<?(t)|| < L where ||-|| is the (£2) Euclidean norm. 

B. Example: Training a Classifier 

For a specific example of this setup, consider the problem of training an SVM classifier using a hinge-loss 
with £2 regularization Q. In this case, the data stream consists of pairs {x(t),y(t)} such that x(t) € X and 
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y(t) € { — 1, +!}■ The goal is to minimize the misclassification error as measured by the ^-regularized hinge loss. 
Formally, we wish to find the w* G W C R d that solves 

^ m 

minimize^gw — \\w\\ 2 H max{0, 1 — y(t)(w, x(t))} (4) 

2 rn z — ' 

t=i 

which is er-strongly conve^ For these types of problems, using a single-processor stochastic gradient descent 
algorithm, one can achieve = 0( lo |, T ) Q or R ^ = O(^) [12] by using different update schemes. 

C. Distributed Online Convex Optimization 

In this paper, we are interested in solving online convex optimization problems with a network of computers. 
The computers are organized as a network G = {V,E) with |V| = n nodes, and messages are only exchanged 
between nodes connected with an edge in E. 

Assumption 3: In this work we assume that G is connected and undirected. 

Each node i receives a stream of data Xj(l), Xj(2), . . . , similar to the serial case, and the nodes must collaborate 
to minimize the network-wide objective 

T n 

^w^EE/'H (5) 

t=l 1=1 

where ff(w) — f(w,Xi(t)) is the cost incurred at processor i at time t. In the distributed setting, the definition of 
regret is naturally extended to 

T n T n 

A( r )=EE/W')' I *( t ))- mm V^/Kx.W). (6) 

t=l i=l t=l i=i 

For general convex cost functions, the distributed algorithm proposed in |8| has been proven to have an average 
regret that decreases at a rate \/T, similar to the serial case, and this result holds even when the algorithm receives 
noisy, unbiased, observations of the true subgradients at each step. In the next section, we present a distributed 
algorithm that achieves a nearly-linear rate of decrease of the average regret (up to a logarithmic factor) when the 
cost functions are strongly convex. 

III. Algorithm 

Nodes must collaborate to solve the distributed online convex optimization problem described in the previous 
section. To that end, the network is endowed with a n x n consensus matrix P which respects the structure of G, 
in the sense that [Phi — if £ E, We assume that P is doubly stochastic, although generalizations to the 
case where P is row stochastic or column stochastic (but not both) are also possible fl3) , fl4) . 

A detailed description of the proposed algorithm, distributed online gradient descent (DOGD), is given in 
Algorithm [TJ In the algorithm, each node performs a total of T updates. One update involves processing a single 
data point Xi(t) at each processor. The updates are performed over k rounds, and T s updates are performed in round 

1 Although the hinge loss itself is not strongly convex, adding a strongly convex regularizer makes the overall cost function strongly convex. 
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Algorithm 1 DOGD 



1: 


Initialize: T t = [f ] , ai = 1, fe = 1, z}{\) = wj(l) = 




2: 
3: 


while £* =1 T S <T do 


> Each node i repeats 


4: 


for t = 1 to T fc do 




5: 


Send/receive zf(t) and (<) to/from neighbors 




6: 


Obtain next subgradient g^t) 6 d w ff(wf(t)) 




7: 






8: 


w *(t + i) = n w [*?(* + 1)] 




9: 


end for 




10: 


w h + 1 {l)=w^{T k ) 




11: 


^ +1 (1)=^ +1 (1) 




12: 


*i +1 = ^eSi^w 




13: 






14: 






15: 


flfc+i <- ^ 




16: 


fc = fe + 1 




17: 


end while 





s < k. The main steps within each round (lines 9-11) involve updating an accumulated gradient variable, z^(t), by 
simultaneously incorporating the information received from neighboring nodes and taking a local gradient-descent 
like step. The accumulated gradient is projected onto the constraint set to obtain wf (t), where 

IIvv [z] = argmin \ \w — z\\ (7) 

weW 

denotes the Euclidean projection of z onto W, and then this projected value is merged into a running average u)j(r). 
The step size parameter at remains constant within each round, and the step size is reduced by half at the end of 
each round. The number of updates per round doubles from one round to the next. 

Note that the algorithm proposed here differs from the distributed dual averaging algorithm described in |8j, where 
a proximal projection is used rather than the Euclidean projection. Also, in contrast to the distributed subgradient 
algorithms described in 115), DOGD maintains an accumulated gradient variable in z\(t + 1) which is updated 
using {zj(t)} as opposed to the primal feasible variables {wj(t)}. Finally, key to achieving fast convergence is 
the exponential decrease of the learning rate after performing an exponentially increasing number of gradient steps 
together with a proper initialization of the learning rate. 

The next section provides theoretical guarantees on the performance of DOGD. 
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IV. Convergence Analysis 

Our main convergence result, stated below, guarantees that the average regret decreases at a rate which is nearly 
linear. 

Theorem 1: Let Assumptions [TJj3]hold and suppose that the consensus matrix P is doubly stochastic with constant 
A2. Let w* be the minimizer of F(w). Then the sequence {wf} produced by nodes running DOGD to minimize 
F(w) obeys 

F(wr)-F(wn = o(^^) ) (8) 

where k — [log 2 (T/2 + 1)J is the number of rounds executed during a total of T gradient steps per node, and wf 
is the running average maintained locally at each node. 

Remark 1: We state the result for the case where A2 is constant. This is the case when G is, e.g., a complete graph 
or an expander graph fT6) . For other graph topologies where A 2 shrinks with n and consensus does not converge 
fast, the convergence rate dependence on n is going to be worse due to a factor 1 — \f\2 in the denominator; see 
the proof of Theorem [I] below for the precise dependence on the spectral gap 1 — . 

Remark 2: The theorem characterizes performance of the online algorithm DOGD, where the data and cost 
functions /* are processed sequentially at each node in order to minimize an objective of the form 

W = -£^£/iM- (9) 

i=l t=l 

However, as pointed out in |4), if the entire dataset is available in advance, we can use the same scheme to do 
batch minimization by effectively setting ff(w) — f}{w), where f}{w) is the objective function accounting for the 
entire dataset available to node i. Thus, the same result holds immediately for a batch version of DOGD. 

The remainder of this section is devoted to the proof of Theorem [T] Our analysis follows arguments that can be 
found in Q], (8), p2| and references therein. We first state and prove some intermediate results. 

A. Properties of Strongly Convex Functions 

Recall the definition of er-strong convexity given in Assumption [T] A direct consequence of this definition is that 
if F(w) is cr-strongly convex then 

F(w) - F(w*) > I ||y;- w*\\ 2 . (10) 

Strong convexity can be combined with the assumptions above to upper bound the difference F(w) — F(w*) for 
an arbitrary point w G W. 

Lemma 1: Let w* be the minimizer of F(w). For all w € W, we have F(w) — F(w*) < 

Proof: For any subgradient g of F at w, by convexity we know that F(w) — F(w*) < (g,W — w*). It 
follows from Assumption [2] that F(w) — F(w*) < L \\w — w*\\. Furthermore, from Assumption [TJ we obtain that 
f \\w - w*\\ 2 < L\\w - w*\\ or \\w -w*\\ < ^. As a result, F(w) - F(w*) < ■ 



July 23, 2012 



DRAFT 



7 



B. The Lazy Projection Algorithm 

The analysis of DOGD below involves showing that the average state, - Y17=i w i(t)> evolves according to the 
so-called (single processor) lazy projection algorithm [1], which we discuss next. The lazy projection algorithm is 
an online convex optimization scheme for the serial problem discussed at the beginning of Section [IT] A single 
processor sequentially chooses a new variable w(t) and receives a subgradient g(t) of f(w(t),x(t)). The algorithm 
chooses w(t + 1) by repeating the steps 

z(t + l)=z{t)-ag(t) (11) 

w(t + 1) =n w [z(t + 1)] . (12) 

By unwrapping the recursive form of (jTTJ, we get 

t 

z{t + l) = -a^g{t) + z{l). (13) 

s=l 

The following is a typical result for subgradient descent-style algorithms, and is useful towards eventually 
characterizing how the regret accumulates. Its proof can be found in the appendix of the extended version of QJ. 

Theorem 2 (Zinkevich [1]): Let w(l) 6 W, let a > 0, and set z(l) = w(l). After T rounds of the serial lazy 
projection algorithm (fTTj)-(fT2"|), we have 

v^/ / n / n ^ \\w(l)-w*\\ 2 TaL 2 
t=\ 

Theorem |2] immediately yields the same bound for the regret of lazy projection |IJ. 

C. Evolution of Network- Ave rage Quantities in DOGD 

We turn our attention to Algorithm [T] A standard approach to studying convergence of distributed optimization 
algorithms, such as DOGD, is to keep track of the discrepancy between every node's state and an average state 
sequence defined as 

1 - 

z k (t) = -J2 z Ht) and w k (t) = II W [z k (t)] . (15) 
U i=l 

Observe that z k (t) evolves in a simple recursive manner, 

n 

z k (t + l) =-Y z k (t + l) (16) 



n 



1 n 

71 ^ 

i=l 



^2pijZ k (t) - a kgi (t) 



(17) 



It^t^-^t^) (18) 

j=l i=l i=l 

n 

z(t)--Y.aS) d9) 



n 



t 1 n 1 n 



« fc > ,-> M*) + -> X(l) (20) 



n * — ' n 

s—l i—1 i—1 



July 23, 2012 



DRAFT 



s 



where equation (jT9]» holds since P is doubly stochastic. Notice (cf. eqn. ( fT3) >) that the states \z k (t), w k (t)} evolve 
according to the lazy projection algorithm with gradients g(t) = - ^"=1 9i(t) an d learning rate a k . In the sequel, 
we will also use an analytic expression for z\ (t) derived by back substituting in its recursive update equation. After 
some algebraic manipulation, we obtain 

i-l n 

z\(t) =-a k J2J2 [ pt ~ S+1 ] l3 9j(s - 1) - a k Mt 1) 
s =i J= i 



and since the projection in non-expansive and zj(l) = 0,Vi, 

K +1 (i)|| =lk? +1 (i)|| = H(r k )\\ = ||n w [z?(T fc )]| 

<U(Tk)\\ 



< 



T k — 1 n 

t=i i=i 



+ ||-a*ft(T fc - 1)|| +£[P T % (1)| 

71 

<a k T k L + £ [P T % ||^(1)|| 
<••• 



(21) 

(22) 
(23) 



(24) 

(25) 

(26) 
(27) 



Z3. Anafys/s 0/ One Round of DOGD 

Next, we focus on bounding the amount of regret accumulated during the fcth round of DOGD (lines 5-12 of 
Algorithm 1) during which the learning rate remains fixed at a k - Using Assumptions [T] [2] and the triangle inequality 
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we have that 

f^[F(w k (t))-F(i 



t=i 



= E t F ( wk (*)) - F K) + f K"W) - ^ (*))] (28) 

t=l 

<^ [F(w fe (t)) -F(tu*) + L||iu*(t) -w k (t)\\] (29) 
t=i 



E-E^^*))-*^*))] 



+ 11^(4) - w k (t)\\ (30) 
4=1 

<Y,-H^)M{t)-w*) 



At 

Tk , n 



^ K 1 " 

+E-E L Kw-^ fc ( < )ll 
t=i n i=i 

+ Y / L\\w k (t)-w k (t)\\. (31) 



t=i 

For the first summand we have 

T k -, n 



n ■ 
4=1 i=i 



A 

T k -, n 



n 

4=1 i=l 



(32) 



- 1 fc i '* 

^=E-E^w^fw-^) 

4=1 " i=l 

^E^EtoW.^*)-™*) 

4=1 " i=l 

+E^E^(*)' w ?(*)- # (*)) ( 33 > 

4=1 i=l 

^E^E^w.^w-^*) 



E^X>K(*)-^(*)II- 04) 
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To bound term A 2 we invoke Theorem [2] for the average sequences {w k (t)} and {z k (t)}. 

T k , n 

A a (*),«?* (*)-«;*) (35) 



n 

t=l i=l 



Tr. 



= E(-I>W' n w[^W] -*>*) (36) 



v n 

t=i t=i 

Tk 



= Y,(9(t)^w[z k (t)} -w*) (37) 
t—i 

Jw k (l)-w*\\ 2 | T k a k \\^ =l9i (t)\\ 2 (38) 
2a k 2 



U fc (l) - w*\\ T k a k L 



2 



2a k 2 

Collecting now all the partial results and bounds, so far we have shown that 

t=i k 

Tk r, n 

+E-E L IK( t )-^ fe wll 

t=l f=l 



and since the projection operator is non-expansive, we have 



n 

t=l i=l 



t=l 



(39) 



^L||iuf(i)-W fe (t)||. (40) 



E-E L lk fe w-^wll ( 41 > 



The first two terms are standard for sub gradient algorithms using a constant step size. The last two terms depend 
on the error between each node's iterate z k {t) and the network-wide average z (t), which we bound next. 



E. Bounding the Network Error 

What remains is to bound the term \\zj? (t) — z k (t)\\ which describes an error induced by the network since the 
different nodes do not agree on the direction towards the optimum. By recalling that P is doubly stochastic and 
manipulating the recursive expressions ( f2"T] i and ( |20] > for Zj(t) and z k (t) using arguments similar to those in |8J, 
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fl4| , we obtain the bound, 



?(t)-z k (t)\\ <a k Lj2J2 

8=1 j = l 



+E 

3 = 1 
t-1 



2a k L 



(42) 



2a k L 



E 

3=1 



14(1)1 



(43) 



The l\ norm can be bounded using Lemma [2] which is stated and proven in the Appendix, and using ( p7| i we arrive 
at 



(44) 



1 - v/Aa T k 
where A2 is the second largest eigenvalue of P. Using this bound in equation ( pTTj i, along with the fact that F(w) 
is convex, we conclude that 



" F(w*) =F (~ E w " (*)J " F K) 
<i^[FK fe (t))-FK)] 



(45) 
(46) 



< 



|w fc (l) — w* II a k L 2 



2a k T k 



L 2 a k 



6 log(7V^) +9 



1- Va 2 



3£ 2 E*~j a ^ 



(47) 



where w k (l)=U w [i^^(l)]. 



Ana/ys/s of DOGD over Multiple Rounds 

As our last intermediate step, we must control the learning rate and update of T k from round-to-round to ensure 
linear convergence of the error. From strong convexity of F we have 

iK(i) - *f < 2 «m^v) (48) 

(7 

and thus 



F(^ fc+1 )~f>*) < 



F(w fe (l)) 



L 2 a fc 



12 log(^) +19 



l-v% 



+ 



(49) 
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Now, from Theorem 3 in Q which is a direct consequence of Theorem [2] for the average sequence w viewed as a 
single processor lazy projection algorithm, we have that after executing T k -i gradient steps in round k — 1, 



F(w k (l)) ~F(w*) < 



BT-^l) - w* 



a k -iL 2 



and by repeatedly using strong convexity and Theorem [2] we see that 

F{w k {\)) -F(w*) < 



F{w k ~ 1 {l)) -F(w*) a^L 2 



aak-iTk-i 

<■■ 
F(W X (1)) - F(w*) 

YliZoi^k-jTk-j) 



(50) 

(51) 
(52) 



< 



fc-i 

E 



dk-jL 2 



(53) 



3=1 2 Y[ 3 s =\{a a k-sT k - a ) 

Now, let us fix positive integers b and c, and suppose we use the following rules to determine the step size and 
number of updates performed within each round: 

flfc-i 0,1 



b k-i 



T k =dT k _i = --- = c k - 1 Ti. 
Combining ( |53] l with ( |49] l and invoking Lemma [T] we have 

F{w k+1 ) - F(w*) < 



(54) 
(55) 



2L 2 



fc-i 

E 



ai L 2 



L 2 ai 



fe-s-l 



12 iog(r lC *-Vn) ig 



1 - v/Aa 

s-1 



(56) 



Tic*- 1 

To ensure convergence to zero, we need c > b and aaiTi > 1 or ai > ^r^. Given these restrictions, let us make 
the choices 



ai = 1, Ti = 



2 
a 



c = b = 2. 



(57) 
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To simplify the exposition, let us assume that T\ = =■ is an integer. Using the selected values, we obtain 

2L 2 



Fiw^ 1 ) ~F{w*) < 



fc-i 

E 

3=1 



-n^ 2(d 



Ml: (2(l) fc " s - 1 

logd^^- 1 ^ 



2-2 fe -! 



12- 



19 



< 



2 fe-i 
it-i 



L 2 
2 k 



12 



19 



< 



2L 2 L 2 (k-1) 



< 



a2 k 
L 2 

+ 

2L 2 
I? 



12 



2 fc 



19 



L 2 (/c-l) 



+ 



2 A: 



12 



2 fc 

, /2 fc v/riN 



19 



3L 2 (fc-l) 
2 Frl 



6L 2 (fc- 1) 



6L 2 (/s- 1) 
2 fe ' 



Finally, we have all we need to complete the analysis of Algorithm [T] 



(58) 



(59) 



(60) 



(61) 



G. Proof of Theorem [7] 

Suppose we run Algorithm [I] for T total steps at each node. This allows for k rounds, where k is determined by 
solving 

k k 



^T,<T^^2-2 4 <T^fc< log 2 ( I + 1 ) . 



(62) 
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Using this value for k we see that 



F(wj +1 ) - F(w*) < — 2*+ L2{k l? 
a 



2 k 



1?_ 
2* 



log ( ) 

12 &y a ' + 19 

1 - Vxi 



6L 2 (k-l) 



2 ~k 



< 



L 2 , L 2 (log 2 (| + l)-l) 



(l + i) 



log 



(1+1) 

(f+l)Vn 



12- 



l - 



19 



6L 2 ((f + !)-!) 
(f + 1) 

/ logJVgT) \ = Q f log(VnT) 



T(l - y/M)J V ^ 

when A2 is constant and does not scale with n, and this concludes the proof of Theorem [T] 



(63) 



(64) 



V. Extension to Stochastic Optimization 

The proof presented in the previous section can easily be extended to the case where each node receives a random 
estimate g(t) of the gradient, satisfying E[g(i)] = g(t), instead of receiving g(t) directly. We assume that noisy 
gradients still have bounded variance i.e., E[||<7i(i)|| 2 ] < L 2 . In this setting, instead of equation <j35j, we have 

A 2 =J]-f2(gi(t),w k (t)-w*) (65) 

t=l i=l 



=E< r ,5> ( * ) ' wfe( * ) ~™* ) 

t=l i=l 
Tk n 



w 



n 

t=l i=l 



(66) 



However, the proof of Theorem [2] does not depend on the gradients being correct; rather, it holds for noisy gradients 
git) as well. Moreover, we have E[||<7j(i)||] < L, and by Holder's inequality E[||<7i(t)| ||</j(f)||] < L 2 . Thus, 



E 



1 n 1 n 

-!>(*) <^E E [ii5i(*)iiii9i(t)ii]<£ 2 



(67) 
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Thus, invoking Theorem |2j if the new data and thus the sub gradients are independent of the past, and since 
M9i(t)] = gi(t), we have 

]w k (l)-w*\\ 2 T k a k L 2 



E[A 2 ] < j 



2a k 2 

E E " E^W - 9i{t),™ k (t) - «*>] (68) 



n 

t=i i=i 



\w k (l)-w*\\ , T k a k L 2 



2a k 



£<E -&(*)], 
t=i i=i 

W fc (l)- W *f , T fe a fc L 2 



2afc 2 

Furthermore, the network error bound holds in expectation as well, i.e., 

E[||w fc (t)-^(*)||]<E[|^(t)-^(t)||] 



(69) 
(70) 



<2a k L f { ^ ' +3a k L+ ^ (71) 

1 - V A 2 J fe 



Collecting all these observations we have shown that, in expectation, 

E[FK^)-FK)]<l |Wfe(1) ~ W * 112 ' ^ 



2a k T k 
L 2 a k 



6 to8(r ^ + 9 



(72) 



which, after using the update rules for a k and T k , is exactly the same rate as before. We note however that 
there may still be room for improvement in the distributed stochastic optimization setting since fT2) describes a 
single -processor algorithm that converges at a rate O (i). 

VI. Simulation 

To illustrate the performance of DOGD we simulate online training of a classifier by solving the problem Q 
using a network of 10 nodes arranged as a random geometric graph. Each node is given T = 600 data points, and 
the input dimension is d = 100. We set a = 0.1 and generate the data from a standard normal distribution and 
classify them as —1 or 1 depending on their relative position to a randomly drawn hyperplane in R d . As we see 
in Figure [T[ DODG minimizes the objective much faster than Distributed Dual Averaging (DDA) [8| which has 
a convergence rate of O ^ lo s(JV") ^ _ DDA is simulated using the learning rate that is suggested in [8|. We have 
observed that boosting this learning rate may yield faster convergence, but still not as fast as DOGD. Figure [T] also 
shows the performance of a version of Fast Distributed Gradient Descent (FDGD) flT) . As we can see, FDGD fails 
to converge in an online or stochastic setting and ends up oscillating. 
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Fig. 1. Optimization of a d = 100 dimensional problem of the form j4} with a random network of 10 nodes. Our proposed algorithm 
DOGD(red) converges faster than DDA(green) as expected from the T instead of vT in the denominator of the convergence rate bound. 
FDGD(black), is unable to converge in the online problem. 



VII. Future Work 

In this paper we have proposed and analyzed a novel distributed optimization algorithm which we call Distributed 
Online Gradient Descent (DOGD). Our analysis shows that DOGD converges at a rate 0( lo g(V" T ) ) yyhgn solving 
online, stochastic or batch constrained convex optimization problems if the objective function is strongly convex. 
This rate is optimal in the number of iterations for the online and batch setting and slower than a serial algorithm 
only by a logarithmic factor in the stochastic optimization setting. 

In its current form, DOGD requires the nodes in the network to exchange gradient information at every iteration. 
Our preliminary investigation suggests that gradually performing more and more updates between each communi- 
cation can speed up distributed optimization algorithms in the batch setting when one explicitly accounts for the 
time required to communicate data. Our future work will carry out a similar analysis for online and stochastic 
optimization algorithms. 



Appendix 

Lemma 2: If P is a doubly stochastic matrix defined over a strongly connected graph G — (V, E) with \V\ — n 
nodes so that pji = if E, then for any t <T, 



E 



i 



[p 



t-s+ll 



< 1 



log gVn) 



= 2 



1 



TV 



(73) 



where A2 is the second largest eigenvalue of P. 

Proof: If the consensus matrix P is doubly stochastic it is straightforward to show that P* — > ^11 T as t — >• 00. 
Moreover, from standard Perron-Frobenius is it easy to show (see e.g., fl7)) 



(74) 
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so in our case 



1 it 



[P*- 



3 + 11 



< (v^)* S+1 - Next, demand that the right hand side bound is less than 

(^) 



with 5 to be determined: 



So with the choice <5 1 = ^/nT, 



< y/n6 ^t-s + l> 



n 



l0g(\A2 X ) 



1 1 



1 \Jnl 1 



if t - s + 1 > lo f ( * \ = £ when s is large and t - s + 1< t we take 

log (VA 2 ) 

bound is not obtained as follows 

t-l , t-i-l 



k\T _ fpt-s+lj 



t-l 



n 



|-pt-s+lj 



< 



s=t-« 

t— t— 1 . t-l 

E ^+ E 2 



8=1 



S=t-t 



<^r- +2t < l + 2t 



(75) 
(76) 

< 2. The desired 
(77) 



(78) 



(79) 



Since t < T we know that i — t < T. Moreover, log (s/Xz) 1 > 1 — \f\-2- Using there two fact we arrive at the 
result. The same bound is true for any individual entry of P* approaching -. ■ 
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